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Abstract 

Many Big Data applications in business and sci¬ 
ence require the management and analysis of huge 
amounts of graph data. Previous approaches for 
graph analytics such as graph databases and par¬ 
allel graph processing systems (e.g., Pregel) either 
lack sufficient scalability or flexibility and expres¬ 
siveness. We are therefore developing a new end- 
to-end approach for graph data management and 
analysis based on the Hadoop ecosystem, called 
Gradoop (Graph analytics on Hadoop). Gradoop 
is designed around the so-called Extended Property 
Graph Data Model (EPGM) supporting semantically 
rich, schema-free graph data within many distinct 
graphs. A set of high-level operators is provided 
for analyzing both single graphs and collections of 
graphs. Based on these operators, we propose a 
domain-specific language to define analytical work- 
flows. The Gradoop graph store is currently utiliz¬ 
ing HBase for distributed storage of graph data in 
Hadoop clusters. An initial version of Gradoop has 
been used to analyze graph data for business intelli¬ 
gence and social network analysis. 


1 Introduction 

Graphs are simple, yet powerful data structures to 
model and analyze relations between real world data 
objects. The flexibility of graph data models and 
the variety of graph algorithms made graph analytics 
attractive in different domains, e.g., for web infor¬ 
mation systems, social networks [20], business intelli¬ 
gence [441152859] or in the life sciences [T7]l38] . Entities 
such as web sites, users or proteins can be modeled 
as vertices while their connections are represented by 
edges in a graph. Based on that abstraction, graph 
algorithms help to rank websites, to detect comrnu- 
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Figure 1: Key steps of end-to-end graph analytics 


nities in social graphs, to identify pathways in bio¬ 
logical networks, etc.. The graphs in these domains 
are often very large with millions of vertices and bil¬ 
lions of edges making efficient data management and 
execution of graph algorithms challenging. For the 
graph-oriented analysis of heterogeneous data, possi¬ 
bly integrated from different sources, graphs should 
also be able to adequately represent entities and re¬ 
lationships of different kinds. 

Currently, two major kinds of systems focus on 
the management and analysis of graph data: graph 
database systems and parallel graph processing sys¬ 
tems. Graph database systems, such as Neo4j [5] 
or Sparksee H, focus on the efficient storing and 
transactional processing of graph data where multi¬ 
ple users can access a graph in an interactive way. 
They support expressive data models, such as the 
property graph model [35] or the resource descrip¬ 
tion framework [34] , which are suitable to repre¬ 
sent heterogeneous graph data. Furthermore, graph 
database systems often provide a declarative graph 
query language, e.g., Cypher [7, or SPARQL 251 . 
with support for graph traversals or pattern match¬ 
ing. However, graph database systems are typically 
less suited for high-volume data analysis and graph 
mining [271 HU [55] and often do not support dis¬ 
tributed processing on partitioned graphs which lim¬ 
its the maximum graph size to the resources of a sin¬ 
gle machine. 

By contrast, parallel graph processing systems such 
as Google Pregel m or Apache Giraph [2 pro- 


























cess and analyze large-scale graph data in-memory 
on shared nothing clusters. They provide a tailored 
computational model where users implement parallel 
graph algorithms by providing a vertex-centric com¬ 
pute function. However, there is no support of an 
expressive graph data model with heterogeneous ver¬ 
tices and edges and high-level graph operators. Par¬ 
allel in-memory graph processing is also supported by 
Apache Spark and its GraphX m component as well 
as Apache Flink [ID] . In contrast to Giraph or Pregel, 
these systems provide more powerful workflow-based 
analysis capabilities based on high-level operators for 
processing and analyzing both graph data as well as 
other kinds of data. However, these systems also 
lack support for permanent graph storage and gen¬ 
eral data management features. Furthermore, there 
is no support for storing and analyzing many distinct 
graphs rather than only a single graph. 

The discussion shows that the previous approaches 
for graph data management and analysis have both 
strengths and restrictions (more approaches will be 
discussed in section [D] on related work). We are espe¬ 
cially missing support for an end-to-end approach for 
scalable graph data management and analytics based 
on an expressive graph data model including pow¬ 
erful analytical operators. Furthermore, we see the 
need for an advanced graph data model supporting 
storage and analysis for collections of graphs, e.g., 
for graph comparison in biological applications m oi- 
graph mining in business information networks H31 . 
The approach should also support the flexible integra¬ 
tion of heterogeneous data within a distributed graph 
store as illustrated in Fig. [I] 

At the German Big Data center of excellence ScaDS 
Dresden/Leipzig, we have thus started with the de¬ 
sign and development of the Gradoop (Graph Ana¬ 
lytics on Hadoop) system for realizing such an end- 
to-end approach to scalable graph data management 
and analytics. Its design is based on our previous 
work on graph-based business intelligence with the so- 
called BIIIG approach for integrating heterogeneous 
data within integrated instance graphs [44ll46j . In the 
ongoing implementation of Gradoop we aim at lever¬ 
aging existing Hadoop-based systems (e.g., HBase, 
MapReduce, Giraph) for reliable, scalable and dis¬ 
tributed storage and processing of graph data. 

In this paper we present the initial design of the 
Gradoop architecture (section [2]) and its underlying 
data model, the so-called Extended Property Graph 
Data Model (EPGM, section [3]). We also outline the 
implementation of the HBase-based graph data store 
(section [jj and demonstrate the usefulness of our ap¬ 
proach for two use cases (section [5]). Our main con¬ 
tributions can be summarized as follows: 
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Figure 2: Gradoop High-Level Architecture 


• We present the high-level system design of 
Gradoop, a new Hadoop-based framework for 
distributed graph data management and analyt¬ 
ics. Gradoop aims at an end-to-end approach 
according to Fig. [Qwith workflow-based integra¬ 
tion of source data into a common graph store, 
workflow-based graph analysis as well as the vi¬ 
sualization of analysis results. 

• We propose the powerful yet simple EPGM 
graph data model. It supports both graphs and 
collections of graphs with heterogeneous vertices 
and edges as well as declarative operators for 
graph analytics. We also show how the opera¬ 
tors can be used for the declaration of analytical 
workflows with a new domain-specific language 
called GrALa (Graph Analytical Language). 

• We describe the design and implementation of 
our distributed graph store built upon Apache 
HBase. It supports partitioning, replication and 
versioning of large, heterogeneous graphs. 

• We show the applicability of an initial implemen¬ 
tation of Gradoop using two use cases for social 
network analysis and business intelligence. 


2 Gradoop Architecture 

With Gradoop we aim at providing a framework for 
scalable graph data management and analytics on 
large, semantically expressive graphs. To achieve hor¬ 
izontal scalability of storage and processing capacity, 
Gradoop runs on shared nothing clusters and utilizes 
existing Hadoop-based software for distributed data 
storage and processing. 














































Fig. m shows the high-level architecture of 
Gradoop. Gradoop users can declare data integra¬ 
tion and analytical workflows with either a visual in¬ 
terface or by using GrALa, our domain specific lan¬ 
guage (DSL). Workflows are processed by the work- 
flow execution layer which has access to the actual 
operator implementations, which in turn access the 
distributed graph store using the EPGM graph data 
model. After processing a workflow, results are pre¬ 
sented to the user. In the following, we briefly explain 
the core components and discuss them in more detail 
in subsequent sections. 

Distributed Graph Store The distributed graph 
store manages a persistent graph database structured 
according to the EPGM graph data model. It of¬ 
fers basic methods to store, read and modify graph 
data, and serves as the data source for operators di¬ 
rectly operating on the permanent graph representa¬ 
tion, e.g., when using MapReduce. Furthermore, the 
graph output of operators can be written to the data 
store. 

Efficient graph processing demands a fast retrieval 
of vertices and their neighborhood as well as a low- 
latency communication between connected vertices. 
The graph data has to be physically partitioned 
across all cluster nodes so that data is equally dis¬ 
tributed for load balancing. Furthermore, the par¬ 
titioning should support graph processing with lit¬ 
tle communication overhead (locality of access). The 
Gradoop graph store also manages different versions 
of the underlying graph, e.g., to enable the analysis 
of graph changes over time. Finally, the graph store 
needs to be resilient against failures and avoid data 
loss. 

Our current storage implementation builds on 
HBase [4], an open-source implementation of Google’s 
BigTable ITB] , running on the Apache HDFS (Hadoop 
Distributed File System) [3.. HBase supports high 
data availability through redundant data storage 
and provides data versioning as well as partitioning. 
Many Hadoop processing components (e.g., MapRe¬ 
duce, Giraph, Flink) have built-in support for HBase 
thererby simplifying their use for realizing Gradoop 
functionality. In section [4] we describe the Gradoop 
graph store in more detail. 

Extended Property Graph Data Model Our 

extended property graph data model (EPGM) de¬ 
scribes how graph databases are structured and de¬ 
fines a set of declarative operators. EPGM is an ex¬ 
tension of the property graph model |49j . which is 
used in various graph database systems. To facili¬ 


tate integration of heterogeneous data from different 
sources, it does not enforce any kind of schema, but 
the graph elements (vertices, edges, logical graphs) 
can have different type labels and attributes. For 
enhanced analytical expressiveness, EPGM supports 
multiple logical graphs inside a graph database. Logi¬ 
cal graphs are, as well as vertices and edges, first-class 
citizens of our data model which can have their own 
properties. Furthermore, collections of logical graphs 
can be the input and output of analytical operators. 
The details of our data model and its analytical op¬ 
erators are discussed in section [3] 


Operator Implementations The EPGM opera¬ 
tors need to be efficiently implemented for use in an¬ 
alytical workflows. This also holds for further oper¬ 
ators for presenting analysis results and to perform 
data import, transformation and integration tasks to 
load external data into the Gradoop graph store. All 
these operators have access to the common graph 
store and need to be executed in parallel on the un¬ 
derlying Hadoop cluster. For the operator imple¬ 
mentation Gradoop utilizes existing systems such as 
MapReduce, Giraph, Flink or Spark thereby taking 
advantage of their respective strengths. For example, 
MapReduce is well suited for ETL-likc data transfor¬ 
mation and integration tasks while Giraph can effi¬ 
ciently process graph mining algorithms. We imple¬ 
mented a first set of operators as a proof of concept 
and use them in our evaluation in section [5] 


Workflow Execution The workflow execution 
component is responsible for managing the complete 
execution of data integration workflows or analytical 
workflows. Before a declared workflow can be exe¬ 
cuted, it is transformed into an executable program. 
The workflow execution has access to the operator 
implementations and runs and monitors their execu¬ 
tion. Furthermore, it manages intermediate operator 
results and provides status updates to the user. At 
the beginning of a workflow, as necessary for the first 
operator, the graph or parts of it are read from the 
graph store by the execution system. Intermediate 
results are either written to the graph store or are 
cached in memory by the execution layer. The lat¬ 
ter case will be preferred for high performance and 
especially used if two subsequent operators are im¬ 
plemented in the same system, e.g., Spark or Flink. 
Final analysis results are stored or forwarded to the 
presentation layer. 


Workflow Declaration and Result Representa¬ 
tion The typical Gradoop users are data scientists 
and analysts. They are responsible for the specifica¬ 
tion of data integration and analysis workflows and 
evaluate the obtained results. For ease of use, work- 
flows can be declaratively specified by writing a DSL 
(GrALa) script using the available operators which is 
then handed over to the workflow execution subsys¬ 
tem. Alternatively, workflows are visually defined us¬ 
ing a browser-based front-end and then automatically 
transformed into a DSL script. Workflow results are 
either represented through graph visualization (e.g., 
colored subgraphs or specific layouts) or combined 
with charts and tables (e.g., aggregate values for sub¬ 
graphs or frequency distributions of graph patterns). 

As mentioned in the introduction, the implemen¬ 
tation of Gradoop is still going on, so that some of 
the introduced components, in particular the work- 
flow execution, the visual workflow definition and re¬ 
sult representation still need to be completed. The 
current focus is on the graph data model, the defini¬ 
tion of analytical workflows and the underlying graph 
data store. For data integration, we will port the ap¬ 
proaches proposed for BI11G [44j[46] to Hadoop and 
adapt our MapReduce-based Dedoop tool for entity 
resolution [25] to support graph data. 

3 Graph Data Model 

In this section, we introduce the EPGM data model 
of Gradoop. We first describe the representation of 
graph data with EPGM and then present its analyt¬ 
ical operators. 

3.1 Graph Representation 

The design of EPGM data representation is based on 
the following requirements that we have derived from 
various analytical scenarios: 

Simple but powerful The graph model should be 
powerful enough to support the graph structures of 
most use cases for graph analytics. On the other 
hand, it should also be intuitive and easy to use. For 
this reason we favor a model with a flat structure of 
vertices and binary edges. 

Logical graphs Support for more than one graph 
in the data model is advantageous since many ana¬ 
lytical applications involve multiple graphs, such as 
communities in social networks or multiple executions 
of a business process. These graphs may have com¬ 
mon vertices, edges or subgraphs. 


Type labels and attributes Graph data from 
real-world scenarios is often heterogeneous exhibit¬ 
ing multiple types of vertices, edges and graphs. A 
graph model should thus support different types and 
heterogeneous attributes for all of its elements in a 
uniform way. Additionally, the meaning of relation¬ 
ships requires edges to be directed. 

Loops and parallel edges In many real-world sce¬ 
narios there may be self-connecting edges or multiple 
edges having the same incident vertices, for example, 
to describe different relationships between persons. 
Hence, a graph data model should support loops and 
parallel edges. 

In its simplest form, a graph G = (V , E) con¬ 
sists of a set of vertices V and a set of binary edges 
E C V x V. Several extensions of this simple graph 
abstraction have been proposed to define a graph 
data model mm- One of those models, the prop¬ 
erty graph model (PGM) [49][50], already meets our 
requirements in large parts. The PGM is widely 
accepted and used in graph database systems (e.g., 
Neo4j), industrial research projects (e.g., SAP Active 
Information Store [5T] ) and in parallel processing sys¬ 
tems such as Spark GraphX. A property graph is a 
directed multigraph supporting encapsulated proper¬ 
ties (named attributes) for both vertices and edges. 
Properties have the form of key-value pairs (e.g., 
name:Alice or weight:42) and are defined at the in¬ 
stance level without requiring an upfront schema def¬ 
inition. However, the PGM foresees type labels only 
for edges (e.g., knows). Hence, it has no support for 
multiple graphs and respective graph type labels and 
graph properties. Furthermore, there are no opera¬ 
tors on multiple graphs. 

To meet all of the posed requirements, we have 
developed the Extended Property Graph Model 
(EPGM). In this model, a database consists of mul¬ 
tiple property graphs which we call logical graphs. 
These graphs are application-specific subsets from 
shared sets of vertices and edges, i.e., may have com¬ 
mon vertices and edges. Additionally, not only ver¬ 
tices and edges but also logical graphs have a type 
label and can have different properties. Formally, we 
define an EPGM database as 

DBepgm = (V,£, Q,T, r, K, A, k). 

A (graph) database DBepgm consists of a ver¬ 
tex space V = ( Vi ), an edge space £ = (ek ) and 
a set of logical graphs Q = (G m ). Vertices, edges 
and (logical) graphs are identified by the respec¬ 
tive indices i 7 k,m £ N. An edge ek = (vi,Vj) 



Figure 3: Example EPGM database graph 


with Vi,Vj € V directs from ry to Vj and supports 
loops (i.e., i = j). There can be multiple edges be¬ 
tween two vertices which are differentiated by dis¬ 
tinct identifiers. A logical graph G m = (V., E) is 
an ordered pair of a subset of vertices V C V and 
a subset of edges ECS. We use Gdb to denote 
the graph of all vertices V(Gdb) = V and all edges 
E[Gdb ) = S of a database. Graphs may potentially 
overlap such that VGi,Gj € Q : \V(Gi) fl V(Gj)\ > 
0 A \E{Gi) fl E{Gj)\ > 0. For the definition of 
type labels we use label alphabet T and a mapping 
t : (V U S U Q) —> T. Similarly, properties (key-value 
pairs) are defined by key set K, value set A and map¬ 
ping k : (V U S U G) x K — > A. 

Figure [5] shows an example EPGM database graph 
Gdb for a simple social network. Formally, Gdb con¬ 
tains of the vertex space V = {r>o, fio} and the edge 
space E = {eo, .., 623 }. Vertices represent persons, fo¬ 
rums and interest tags, represented by correspond¬ 
ing type labels (e.g., Person) and further described 
by their properties (e.g., type:Tag or name:Alice). 
Edges describe the relationships between vertices and 
also have type labels (e.g., knows) and properties (e.g., 
since: 201 3). The key set K contains all property 
keys, for example, name, title and since, while the 
value set A contains all property values, for example, 
"Alice", "Graph Databases" and 201 5. Vertices with 
the same type label may have different property keys, 
e.g., V4 and V5. 

Furthermore, the sample database contains the log¬ 
ical graph set Q = {Go,Gi,G 2 }, where each logi¬ 
cal graph represents a community inside the social 
network, in this case, specific interest groups (e.g., 
Graph Databases). Such groups can be found by 
application-specific subgraph detection algorithms. 


In this example, users will be part of a community, 
if they are a member of a forum that is tagged by 
a specific topic or have direct interest in that topic. 
Each logical graph has a specific subset of vertices 
and edges, for example, V(Gq) = {r^o, v\, V 4 } and 
E{Gq) = {eo, e 3 , e 6 , e 2 i}. Considering G 0 and Gi, 
it can be seen that vertex sets may overlap such 
that V(Go) (~l V{G\) = {^oWi}- Additionally, also 
graphs have type labels (e.g., Community) and may 
have properties, which can be used to describe the 
graph by annotating it with specific metrics (e.g., 
vertexCount: 3) or general information about that 
graph (e.g., interest:Databases). Usually, logical 
graphs are the result of an operator executed in an 
analytical workflow. If they need to be re-used, logi¬ 
cal graphs can be persisted in the graph store. 

3.2 Operators 

The EPGM provides operators for both single graphs 
as well as collections of graphs; operators may also 
return single graphs or graph collections. In the fol¬ 
lowing, we use collection and graph collection corre¬ 
spondingly. In Gradoop, collections are ordered to 
support application-specific sorting of collections and 
position-based selection of graphs from a collection. 
Table [Q lists our analytical operators together with 
the definitions of their input and output. The ta¬ 
ble also shows the corresponding syntax for calling 
the operators in our domain specific language GrALa 
(Graph Analytical Language). Inspired by mod¬ 
ern programming languages, we use the concept of 
higher-order functions in GrALa for several opera¬ 
tors (e.g., to use an aggregate or a predicate function 
as an operator argument). Based on the input of op- 

























































erators, we distinguish between collection operators 
(shown in the top part of Table [T]) and graph oper¬ 
ators as well as between unary and binary operators 
(single graph/collection vs. two graphs/collections 
input). There are also some auxiliary operators to 
apply graph operators on collections or to call specific 
graph algorithms. In addition to the listed operators, 
GrALa provides basic operators to create, read, up¬ 
date and delete graphs, vertices and edges as well as 
their properties. Since we can store older versions of 
graphs in the Gradoop graph store, we can read dif¬ 
ferent versions of graphs and their elements, e.g., for 
time-based analytics. In the following, we discuss the 
Gradoop operators in more detail. 

Collection Operators Collection operators can be 
applied on collections of graphs, vertices and edges. 
In the following, we focus on graphs as their usage for 
vertices and edges is analogous. 

The selection operator cr^ : Q n — > Q n for col¬ 
lections selects the graphs from the input collection 
that meet a user-defined predicate function ip : Q —»• 
{true, false}. The output is a collection with all 
qualifying graphs. Algorithm [T] shows two exam¬ 
ples using the selection operator in GrALa. We first 
define the input collection (line 1 ) of three logical 
graphs identified by their unique id (e.g., db.G[0] 
corresponds to Go) and assign it to the variable 
collection. The db object is a reference to the 
database graph Gdb- The first user-defined predi¬ 
cate function (line 2 ) will evaluate to true, if the in¬ 
put graph g has a value greater than 3 for property 
key vertexCount, i.e., we want to find all graphs with 
more than three vertices. In line 3, we call the select 
operator to apply the predicate function on the pre¬ 
defined collection. For our example graph in Figure 
0 the result collection only contains db.G[2], 


Algorithm 1 Selection Example 
i: collection = <db.G[0],db.G[1],db.G[2]> 

2 : predicatel = (Graph g => gF'vertexCount"] > 3) 
3 : resultl = collection.select(predicatel) 

4: predicate2 = (Graph g => gF'vertexCount"] == 
g.V.select(Vertex v => 
vF'age”] > 20) .count())) 

5 : result2 = collection.select(predicate2) 


The second example shows that predicates are not 
limited to graph properties but can be specified by 
complex functions on graph vertices or edges. The 
predicate function defined in line 4 allows us to se¬ 
lect all graphs where all vertices have a property age 
with a value greater than 20. To achieve that, we 
access the vertices of the particular graph (i.e., g.V) 


and apply a further predicate which uses the selec¬ 
tion operator on the vertices. The vertex selection 
determines all vertices satisfying the age condition. 
The predicate then evaluates to true if the number 
(i.e., count()) of the resulting vertices equals the total 
number of vertices stored in property vertexCount. 
The count function is a predefined aggregate func¬ 
tion and will be discussed below. For our example 
graph, the resulting collection in line 5 only contains 
db. G [ 1 ]. 

As shown in Table [D we also support the 
set-theoretical operators union, intersection and 
difference on collections. For example, the call 
<db.G[0],db.G[1]>.intersect(<db.G[ 1 ],db.G[2]>) 
results in a collection <db.G[1]>. 


Algorithm 2 Sort and Top Examples 
i: sortedColl = db.G.sortBy("vertexCount”,:desc) 
2 : topGraphs = sortedColl.top(2) 


Furthermore there are operators for eliminating du¬ 
plicate graphs in collections based on their index ( dis¬ 
tinct operator), for sorting collections (sort) and for 
selecting the first n (n £ N) graphs in a collection 
(top). The sort operator returns a collection sorted 
by a graph property k in either ascending or descend¬ 
ing order, denoted by o. Algorithm [ 2 ] shows the usage 
of sort and top in GrALa. 


Binary Graph Operators We also support set- 
theoretical operators to determine the union (com¬ 
bination operator), intersection (overlap) and differ¬ 
ence (exclusion) of two graphs resulting in a new 
graph. For example, the combination operators is 
useful to merge previously selected subgraphs into 
a new graph. The combination of input graphs 
Gi, G 2 is a graph G' consisting of the vertex set 
V(G') = V(G 1 ) U V(G 2 ) and the edge set E(G') = 
E(G\) U E(G 2 ). For our example graph in Figure 
[3l the call db.G[0] .combine(db.G[2]) results in the 
new graph G' = ({w 0 , Vi, v 2 , v 3 , Vi}, {e 0 ,e 6 , e 2 i}). 

Similarly, the overlap of graphs Gi, G 2 is a graph 
G' with vertex set V(G') = V(G\) n V(G 2 ) and 
edge set E(G') = E(G\) D E(G 2 ). Applying the 
exclusion operator to Gi and G 2 determines all Gi 
elements that do not occur in G 2 , i.e., V(G') = 
V(G 1 )\V(G 2 ) and E(G’) = {(u.v) e E(G X ) \ 
u € V(G\) \ V(G 2 ) A v € V(G\) \ V(G 2 )}. In 
the example, the call db.G[0] ,overlap(db.G[2]) re¬ 
turns the graph G' = ({iq, iq}, {eo, ei}) while the call 
db.G[0] .exclude(db.G[2]) results in G’ = ({iq}, 0). 
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G 71 —F Q n collection.select(predicateFunction) : Collection 

G n —F Q n collection.distinct() : Collection 

G 71 —F G 71 collection.sortBy(key,[:asc|:desc]) : Collection 

G n —F Q n collection.top(limit) : Collection 

( Q n ) 2 — F G n collection.union(otherCollection) : Collection 

(G n ) 2 —F G n collection.intersect(otherCollection) : Collection 

(G 71 ) 2 —F G n collection.difference(otherCollection) : Collection 

Combination U 

Overlap n 

Exclusion — 

—F Q graph.combine(otherGraph) : Graph 

G^ —F G graph.overlap(otherGraph) : Graph 

G 2 —F G graph.exclude(otherGraph) : Graph 

Pattern Matching ,ip 

G —F G n graph.match(patternGraph.predicateFunction) : Collection 

Aggregation 7fc,a 

Projection 7iv, e 

Summarization (,g v , 9 e,iv,ie 

G —F G graph.aggregate(propertyKey,aggregateFunction) : Graph 

G —F G graph.project(vertexFunction,edgeFunction) : Graph 

G —F G graph.summarize(vertexGroupingKeys,vertexAggregateFunction, 

edgeGroupingKeys.edgeAggregateFunction) : Graph 

Apply \ 0 

Reduce p 0 

Call Va,P 

G n —F G n collection.apply(unaryGraphOperator) : Collection 

G 71 —F G collection.reduce(binaryGraphOperator) : Graph 

G U G 71 —F G U G n [graph|collection],callForGraph(algorithm,parameters) : Graph 


[graph|collection].callForCollection(algorithm,parameters)Collection 


Table 1: Overview of analytical operators in Gradoop 


Pattern Matching A fundamental operation in 
graph analytics is the retrieval of subgraphs matching 
a user-defined pattern, also referred as pattern match¬ 
ing. For example, given a social network scenario, an 
analyst may be interested in all pairs of users that 
are member of the same forum. We provide the pat¬ 
tern matching operator Pg*,v> : S — > Q n , where the 
search pattern consists of a pattern graph G* and a 
predicate ip : Q —F {true, false}. The operator takes 
a graph G as input and returns a graph collection 
Q' = {G' C G | G' ~ G* A <p(G') = true} containing 
all found matches. Generally speaking, the operator 
finds all subgraphs of the input graph that are isomor¬ 
phic to the pattern graph and fulfill the predicate. 

Algorithm [3] shows an example use of our pattern 
matching operator; the pattern graph is illustrated 
in Figure [4] For GrALa, we adopted the basic con¬ 
cept of describing graph patterns using ASCII char¬ 
acters from Neo4j Cypher [7], where (a)-e->(b) de¬ 
notes an edge e that points from a vertex a to a vertex 
b. In line 1, we describe a pattern of three vertices 
and two edges, which then can be accessed by vari¬ 
ables in isomorphic instances to declare the predicate 
(e.g., graph. V[$a]). Property values are accessed us¬ 
ing the property key (e.g., v["name"]) or in case of 
the type label using the reserved symbol :type. In 
line 2, the predicate is defined as a function which 
maps a graph to a boolean value. In this function, 
vertices and edges are accessed by vertex and edge 
variables and multiple expressions are combined by 
logical operators. In our example, we compare vertex 
and edge types to constants (e.g., g.V[$a][:type] 
== "Forum"). In line 3, the match operator is 
called for the database graph db of Figure [3] using 


pattern and predicate as arguments. For the ex¬ 
ample, the result collection has two subgraphs: Q' = 
{({u 0 , vi, u 9 }, {ei 7 , ei 8 }), ({u 2 , v 3 , u i0 }, {ei 9 , e 20 })}- 


Algorithm 3 Pattern Matching Example 
i: pattern = new Graph(''(a)<-d-(b)-e->(c)") 
2 : predicate = (Graph g => 

g.V[$b][:type]=="Forum" && 
g. E[$d] [: type]==''hasMember"&& 
g.V[$a][:type]=="Person" && 
g.E[$e][:type]=="hasMember”&& 
g.V[$c][:type]=="Person”) 

3: result = db.match(pattern,predicate) 



Figure 4: Example pattern graph and predicate 


Aggregation An operator often used in analytical 
applications is aggregation, where a set of values is 
summarized to a single value of significant meaning. 
In the EPGM, we support aggregation at the graph 
level by providing the operator : Q —F Q. For¬ 
mally, the operator maps an input graph G to an 
output graph G' and applies the user-defined aggre¬ 
gation function a : Q — F R. Thus, the resulting graph 
is a modified version of the input graph with an ad¬ 
ditional property k, such that n{G',k) e-F a(G). The 
resulting property value depends on the applied ag¬ 
gregation function. Basic aggregation functions such 
as count , sum and average are predefined in GrALa 



























and can be applied on graph, vertex and edge collec¬ 
tions (count) and their properties (sum, average), for 
example, to calculate the average age per community 
in a social network or the financial result of a business 
process instance. 

Algorithm [4] shows a simple vertex count exam¬ 
ple where the computed cardinality of the vertex 
set becomes the value of a new graph property 

vertexCount. 


Algorithm 4 Aggregation Example 
i: g.aggregate( 

"vertexCount”, (Graph g => g.V.countQ)) 


Projection The projection operator simplifies a 
graph representation by keeping only vertex and edge 
properties necessary for further processing. Further¬ 
more, it is possible to modify (e.g., rename) prop¬ 
erties of interest. For this purpose, the projection 
operator 7 t„ i£ : Q —F Q applies the bijective projec¬ 
tion functions v : V —F V and e : £ —F £ to an input 
graph G, and outputs the graph G' where V(G') = 
{is(v) | v S V(G)}, E(G') = (e(e) | e e E(G)} and 
G ~ G' (i.e., the input and output graphs are iso¬ 
morphic). The user-defined projection functions are 
able to modify type labels as well as property keys 
and values of vertices and edges, but not their struc¬ 
ture. All properties not specified in the projection 
functions are removed. 

Algorithm [5] shows an example GrALa script to 
project the community graph Go in Figure H to a 
simplified version shown in Figure [5] The vertex 
function in line 1 determines that all vertex prop¬ 
erties are removed, except vertex property "city”, 
which is renamed to "from". Further on, all vertices 
in the projected graph obtain the value of the for¬ 
mer "name” property as label. The edge function in 
line 2 expresses that projected edges show only the 
original edge (type) labels while all edge properties 
are removed. In line 3, the projection operator is 
called on the input graph db.G[0] using the vertex 
and edge functions as arguments. The identifiers in 
the resulting new graph are temporary (e.g., pO) as 
projected graphs are typically reused in another op¬ 
eration. However, as stated above, it is also possible 
to persist temporary graphs. 


Algorithm 5 Projection Example 
i: vertexFunc = (Vertex v => 

new Vertex(v[”name"], {"from":v["city"]}) 

2 : edgeFunc = (Edge e => new Edge(e[:type], {})) 

3: projGraph = db.G[0],project(vertexFunc,edgeFunc) 
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Figure 5: Example projection of Go from figure [3] 


Summarization The summarization operator de¬ 
termines a structural grouping of similar vertices and 
edges to condense a graph and thus to help to uncover 
insights about patterns hidden in the graph [63 | f64]. 
It can also be used as an optimization step to re¬ 
duce the graph size with the intent to facilitate com¬ 
plex graph algorithms, e.g., multi-level graph par¬ 
titioning (33j . The graph summarization operator 
C g v ,g e ,~/ v ,~te :£->•£ represents every vertex group 
by a single vertex in the summarized graph; edges 
between vertices in the summary graph represent a 
group of edges between the vertex group members of 
the original graph. Summarization is defined by spec¬ 
ifying grouping keys g v and g e for vertices and edges, 
respectively, similarly as for GROUP BY in SQL. These 
grouping keys are sets of property keys and may also 
include the type label r (or : type in GrALa). Ad¬ 
ditionally, the vertex and edge aggregation functions 
: V n —F V and y e : £ n —F £ are used to compute 
aggregated property values for grouped vertices and 
edges, e.g., the average age of person groups or the 
number of group members, which can be stored at 
the summarized vertex or edge. 

Algorithm [6] shows an example application of our 
summarization operator using GrALa. The goal is to 
summarize persons in our graph of Fig. [3] according 
to the city they live in and to calculate their aver¬ 
age age. Furthermore, we want to group both the 
edges between users in different cities as well as edges 
between users that live in the same city. The re¬ 
sult of the operator is shown in Figure [6] In line 1 
we use the combine operator to form a single graph 
containing all persons and their relationships to each 
other; this will be the graph to summarize. In line 2 
we define the vertex grouping keys. In this case, we 
want to group vertices by type label : type and prop¬ 
erty key "city". Edges are only grouped by type 
label (line 3). Grouping keys and values are auto¬ 
matically added to the resulting summarized vertices 
and edges. In lines 4 and 5, we define the vertex and 
edge aggregation functions. Both receive the summa¬ 
rized entity (i.e., vSum, eSum) and the set of grouped 
entities (i.e., vertices, edges) as input. The vertex 
function applies the aggregate function average (key) 
on the set of grouped entities to compute the average 
age. The result is stored as a new property avg_age 
at the summarized vertex. The edge function counts 























the grouped edges and adds the resulting value to the 
summarized edge. In line 6, the summarize operator 
is called using the predefined sets and functions as 
argument. 


Algorithm 6 Summarization Example 
l: personGraph = 

db.G[0].combine(db.G[1]).combine(db.G[2]) 

2 : vertexGroupingKeys = {:type,"city"} 

3: edgeGroupingKeys = {:type} 

4: vertexAggFunc = (Vertex vSum, Set vertices => 
vSum[ , 'avg_age"] = vertices.average("age”)) 

5: edgeAggFunc = (Edge eSum, Set edges => 
eSum["count”] = edges.count()) 

6: sumGraph = personGraph.summarize( 
vertexGroupingKeys,vertexAggFunc, 
edgeGroupingKeys,edgeAggFunc) 
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Figure 6: Example summarization 

Auxiliary Operators In addition to the funda¬ 
mental graph and graph collection operators, ad¬ 
vanced graph analytics often requires the use of 
application-specific graph mining algorithms. One 
application can be the extraction of subgraphs that 
cannot be achieved by pattern matching, e.g., the 
detection of communities in a social network |25j 
or business transactions [33] ■ Further on, applica¬ 
tions may require algorithms to detect frequent sub¬ 
graphs m or for statistical evaluations to select sig¬ 
nificant patterns. To support the plug-in of such ex¬ 
ternal algorithms, we provide generic call operators, 
which may have graphs and graph collections as in¬ 
put or output, formally r] a ^p : Q U Q n —> Q U Q n . 
Depending on the output type, we distinguish be¬ 
tween so-called callForGraph (single graph result) 
and callForCollection operators. Algorithm [7] 
shows the use of callForCollection on a single input 
graph. The operator arguments are symbol a to set 
the executed algorithm (e.g., : CommunityDetection) 
and a set of algorithm-specific parameters P. In the 
example, a graphPropertyKey needs to be supplied 
to determine, which graph property should store the 
computed community id. The resulting collection 
communities contains all logical graphs computed by 
the algorithm and can be used for subsequent analy¬ 
sis. 


Furthermore, it is often necessary to execute an 
unary graph operator on more than one graph, for ex¬ 
ample to calculate an aggregated value for all graphs 
in a collection. Not only the previously introduced 
operators aggregation, projection and summariza¬ 
tion, but all other operators with single graphs as 
in- and output (i.e., o : Q —F Q) can be executed 
on each element of a graph collection using the apply 
operator A 0 : Q n —> Q n . For an input graph collec¬ 
tion the specified operator is applied for every graph 
and the result is added to a new output graph collec¬ 
tion. Algorithm [8] demonstrates the apply function 
in combination with the aggregate operator. The lat¬ 
ter is applied on all logical graphs in the database, 
represented by db.G. The result can be seen in our 
example graph in Figure [3l where each logical graph 
has an additional property for the vertex count. 


Algorithm 7 Call Example 
i: communities = graph.callForCollection( 
:CommunityDetection, 
{"graphPropertyKey":"community"}) 


Algorithm 8 Apply Example 
i: db.G.apply(Graph graph => 
graph.aggregate! 

"vertexCount",(Graph g => g.V.count())) 


Algorithm 9 Reduce Example 
i: totalMerge = db.G.reduce! 

(Graph g, Graph f => g.combine(f)) 


Lastly, in order to apply a binary operator on a 
graph collection we adopt the reduce operator often 
found in programming languages and also in paral¬ 
lel processing frameworks such as MapReduce. The 
operator takes a graph collection and a binary graph 
operator as input, formally p 0 : Q n —> Q. The bi¬ 
nary operator o : Q 2 —> Q is initially applied on the 
first pair of elements of the input collection which re¬ 
sults in a new graph. This result graph and the next 
element from the input collection are then the new 
arguments for the binary operator and so on. In this 
way, the binary operator is applied on pairs of graphs 
until all elements of the input collection are processed 
and a final graph is computed. In Algorithm [9] we 
call the reduce operator parametrized with the com¬ 
bine operator on all logical graphs in the database 
in Figure [3] The final graph contains all persons of 
the three communities including their relationships to 
each other. 
































4 Distributed Graph Store 

The distributed graph store is a fundamental ele¬ 
ment of the Gradoop framework. Its main purpose 
is to manage persistent EPGM databases by provid¬ 
ing methods to read and write graphs, vertices and 
edgesQ It further serves as data source and data sink 
for the operator implementations. 

The main requirement for a suitable implementa¬ 
tion of the Gradoop graph store is supporting effi¬ 
cient access to very large EPGM databases with bil¬ 
lions of vertices and edges including their respective 
properties. Graphs of that size can take up to multi¬ 
ple petabytes in space and thus require a distributed 
store that can handle such amounts of data. Fur¬ 
thermore, as already mentioned in section [2J there 
should be different options to physically partition the 
graph data to ensure both load balancing as well as 
data locality with minimal communication overhead 
for graph processing. We also aim at supporting time- 
based graph analytics, so that the store should sup¬ 
port data versioning of the graph structure as well as 
of properties. Finally, the store should provide fault 
tolerance against hardware failures and prevent data 
loss through data replication. 

As there is currently no system to store and man¬ 
age graphs that apply to our data model, we chose 
Apache HBase [1; as the technological platform for 
our distributed graph store. HBase is built on top of 
the Hadoop distributed file system (HDFS) and im¬ 
plements a distributed, persistent, multidimensional 
map. It can store large amounts of structured and 
semi-structured data across a shared nothing cluster 
and provides fast random reads and writes on that 
data to applications. Similar to relational databases, 
HBase organizes data inside tables that contain rows 
and columns. Unlike in the relational model, the ta¬ 
ble layout is not static as each row can have a very 
large number of different columns within column fam¬ 
ilies. This leads to a very flexible storage layout opti¬ 
mized for sparse data and fits perfectly to the EPGM 
where each element can have various properties with¬ 
out following a global schema. 

The most basic unit in HBase is a cell which is iden¬ 
tified by row key, column family, column and times¬ 
tamp. Column families allow the grouping of columns 
based on their access characteristics and can be used 
to apply different storage features on them (e.g., com¬ 
pression or versioning). The timestamp enables data 
versioning at the cell level which is also supporting 
our requirements. HBase does not offer support for 

1 An API documentation is beyond the scope of this paper 
but will be provided in the documentation on our project web¬ 
site www.gradoop.com 


data types, instead, all values including the row key, 
column family, column and cell are represented by 
byte arrays leaving (de-)serialization of values to the 
application. To provide horizontal scalability in terms 
of data size and parallel data access, HBase par¬ 
titions tables into so called regions and distributes 
them among cluster nodes. Built upon HDFS, a dis¬ 
tributed, fault-tolerant file system, HBase also sup¬ 
ports automatic failure handling through data repli¬ 
cation. 

It can be seen, that Apache HBase already ful¬ 
fills most of our stated requirements as it provides 
distributed management of large quantities of sparse 
data, data versioning and fault tolerance. A remain¬ 
ing challenge is to suitably map the EPGM to the 
data model provided by HBase. Furthermore, we 
should exploit the partitioning options of HBase for 
effective graph partitioning. In the following, we dis¬ 
cuss our current implementation choices. 

Graph Layout Our current approach is straight¬ 
forward: we represent the database graph as an ad¬ 
jacency list [TS] and store all vertices inside a single 
table (i.e., the vertex table). Logical graphs are main¬ 
tained in an additional graph table. This approach 
gives us fast access to vertices including their prop¬ 
erties and edges. It also gives us the possibility to 
quickly retrieve graphs including their corresponding 
vertex and edge sets. 

Vertex table The vertex space V and the edge 
space £ of an EPGM database are stored in the vertex 
table. Each row contains all information regarding 
one vertex, i.e., vertex properties, incident edges in¬ 
cluding their properties and references to the graphs 
that contain the vertex. Typically, an operator imple¬ 
mentation (e.g., in MapReduce) loads multiple rows 
from HBase and applies an algorithm on them. We 
store edges denormalized to give the operator imple¬ 
mentations a holistic view on a vertex. By doing so, 
we avoid expensive join computation between the ver¬ 
tex table and a dedicated edge table during graph 
processing. Furthermore, given that HBase offers fast 
random access at the row level, our vertex store lay¬ 
out is advantageous for graph traversals as loading 
all incident edges of a vertex can be done in constant 
time. 

Figure [7] shows a schematic representation of the 
vertex store omitting the time dimension. The graph 
contains three vertices resulting in three rows. Each 
row is identified by a row key which is the primary 
key within the table and composed of a partition id 
and a vertex id. Multiple rows may share the same 
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Figure 7: Vertex table containing the subgraph ({uo, Vi, ug}, {eg, ei, eis, ei7, eis}) of Figured 


partition id, but each vertex must have a unique ver¬ 
tex id which is either provided during data import or 
generated by the graph store. We will explain graph 
partitioning in more detail below. 

Vertex data is further separated into four column 
families for two reasons. First, we assume that not 
all vertex data has the same access characteristics: 
while edges are frequently accessed in many graph 
algorithms, vertex properties on the other hand are 
only needed to evaluate predicate or aggregate func¬ 
tions. Second, HBase storage and tuning features are 
applied at the column family level, for example, com¬ 
pression requires similar column size characteristics 
to work more efficiently. 

The column family meta contains three columns 
at most. While the obligatory column type stores 
the type label encoded by an id (e.g, Person is rep¬ 
resented by 0), the second column graphs stores the 
ids of graphs containing the vertex. The third col¬ 
umn idx stores an index which is used when creating 
outgoing edges (see below). If the vertex is not con¬ 
tained in any logical graph or has no outgoing edges, 
the particular columns do not exist and thus require 
no storage space. 

The second column family properties stores the ver¬ 
tex attributes. The number of grouped columns may 
differ significantly between rows as this depends solely 
on the vertex instance. The property key (e.g, name ) 
is serialized as the column identifier while the prop¬ 
erty data type and the property value (e.g., (5, Alice)) 
are stored in the cell. As HBase solely handles byte 
arrays, the graph store adds support for all primitive 
data types (e.g., String is represented by 5). How¬ 
ever, the property key does not enforce a specific data 
type for the associated value. Furthermore, data ver¬ 
sioning is realized at the cell level and the number of 
versions is configurable in the Gradoop settings. 

The remaining two column families store the in¬ 
cident edges of the vertex. Analogously to proper¬ 
ties, the number of columns may vary significantly 
between rows. To enable efficient traversals in any di¬ 
rection, we currently store both outgoing and incom¬ 


ing edges per vertex. This leads to data redundancy 
as each edge has to be stored twice. However, the 
graph repository guarantees data consistency when 
updating edges. Each column in both column fam¬ 
ilies serializes a single edge. The column stores an 
edge identifier, while the cell stores the edge proper¬ 
ties. An edge identifier, e.g., (2,0-1,0), contains the 
edge type label (e.g., knows is represented by 2), the id 
of the opposite vertex (e.g., 0-1) and an index which 
is unique at the start vertex. The opposite vertex 
identifier refers to the start- or end vertex of the edge 
depending on its direction. The edge index allows 
the definition of parallel edges. If an outgoing edge is 
created, the next available index is read from the idx 
column and incremented afterwards. The graph store 
automatically adds the corresponding incoming edge 
at the target vertex using the same edge identifier 
with switched vertex ids. Edge properties are stored 
as a list of tuples, e.g., [{since, (0,2014))], where each 
tuple contains the property key, type and value. Con¬ 
sequently, reading a single edge property requires the 
deserialization of all edge properties. We decided to 
store edge properties differently from vertex proper¬ 
ties as edges typically have significantly fewer proper¬ 
ties than vertices. Similar to vertex properties, edge 
properties are versioned. 

Graph table The set of logical graphs Q of an 
EPGM database is stored in a second table, the graph 
table. Each row in that table contains all information 
regarding one graph, i.e., references to the vertices 
and edges it contains, a type label and properties. As 
illustrated in Figure [U each row represents a single 
logical graph identified by a unique graph id and de¬ 
scribed by three column families. Similar to the ver¬ 
tex store, each row contains the column families meta 
and properties. While the former consists of the type 
label and a list of vertex identifiers contained in the 
graph, the latter stores graph properties in the same 
way as described for the vertex store. The third col¬ 
umn family edges stores all edges that are incident to 
the vertices contained in the logical graph. Each col- 
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Figure 8: Graph table containing the three logical graphs (Go, Gi, G 2 ) of Figured 


umn stores a vertex identifier and the corresponding 
cell contains its outgoing edges belonging to the logi¬ 
cal graph. This is necessary, as not all incident edges 
of a vertex may be contained in a logical graph. Fur¬ 
thermore, we can exploit the versioning features of 
HBase to load snapshots of logical graphs at a given 
time. While the column vertices stores a versioned 
list of vertex identifiers, for each such identifier, we 
store a versioned list of incident edges, hence making 
the construction of structural snapshots possible. 

Graph Partitioning To achieve scalability of data 
volume and data access, HBase horizontally splits ta¬ 
bles into so called regions and distributes those re¬ 
gions across the cluster. Each cluster node handles 
one or more regions depending on the available re¬ 
sources. Furthermore, rows inside a table are phys¬ 
ically sorted by their row key, whereby each region 
contains a continuous range of rows between a de¬ 
fined start and end key. Region boundaries are either 
determined automatically or can be defined manu¬ 
ally when creating a table. We apply the latter case 
to the vertex table when it is created and define par¬ 
tition boundaries upfront. For example, on a cluster 
with fO nodes, an administrator may define 100 re¬ 
gions for the vertex table. Region boundaries are set 
by using the partition id, which is also used as the 
prefix in the row key. 

Solely defining partition boundaries does not guar¬ 
antee equal data distribution. To achieve that, the 
graph store supports partition strategies that assign 
a vertex to a region. At the moment, we support 
the well-known range and hash partitioning strate¬ 
gies, both requiring a continuous id space. The for¬ 
mer assigns vertices to regions if their vertex id is 
in the partitions range, the latter assigns vertices to 
regions by applying a modulo function on the ver¬ 
tex id. Both strategies do not minimize the number 
of edges between different regions but achieve a bal¬ 
anced data distribution. We currently work on imple¬ 
menting more sensible strategies for improved locality 
of access. 


5 Use Case Evaluation 

In this section, we present an initial evaluation of 
Gradoop for two analytical use cases, namely for so¬ 
cial network analysis and business intelligence. We 
demonstrate the usefulness of the proposed data 
model and operators by showing that the non-trivial 
analysis tasks can be declared in relatively small 
GrALa scripts. To execute the equivalent workflows, 
we used initial implementations of the operators and 
generated the graph data by data generators. We first 
present the two analysis workflows and then discuss 
implementation and evaluation results for different 
graph sizes. 

Social Network Analysis In our first scenario, an 
analyst is interested in communities of a social net¬ 
work. As a meaningful representation, she requires 
a summarized graph with one vertex per community, 
the number of users per community and the num¬ 
ber of relationships between the different communi¬ 
ties. Algorithm [TO] shows a GrALa workflow achiev¬ 
ing such a summarized graph from a social network. 
The original social network graph sng is created us¬ 
ing the LDBC-SNB Data Generator 0H2] and shows 
different types of vertices and edges. However, com¬ 
munities should only group vertices of type Person 
and edges of type knows. So, in the first step, the 
relevant subgraph of Person vertices and knows edges 
is extracted in lines 1 to 4. In more detail, line 1 
defines a pattern graph describing two vertices con¬ 
nected by one edge and line 2 the corresponding pred¬ 
icate. Then, in line 3 pattern and predicate are 
used to match all knows edges between Person ver¬ 
tices. The result is friendships, a collection of 1- 
edge graphs, which is subsequently combined to a sin¬ 
gle graph knowsGraph utilizing the reduce operator. 
In line 5, an external algorithm : LabelPropagation 
[37] is executed to detect communities. We use the 
callForGraph operator, as we need a single graph as 
output, where each vertex has a "community" prop¬ 
erty. Finally, the graph is summarized in line 6. At 
that, vertices are grouped by "community". As the 




























Algorithm 10 Summarized Communities 
Input: Social Network Graph sng 

Output: Summarized graph summarizedCommunities; 
each vertex represents a community and edges represent 
aggregated links between communities 
1: pattern = new Graph(”(a)-c->(b)") 

2 : predicate = (Graph g => 

g.V[$a][:type] == "Person" && 
g.E[$c][:type] == "knows" && 
g.V[$b][:type] == "Person") 

3: friendships = sng.match(pattern,predicate) 

4: knowsGraph = friendships.reduce( 

Graph g, Graph f => g.combine(f)) 

5: knowsGraph = knowsGraph.callForGraph( 

:LabelPropagation,{"propertyKey”:"community"}) 
6: summarizedCommunities = knowsGraph.summarize( 
("community”}, (Vertex vSum, Set vertices) => 
vSum["count"] = vertices.count()), 

(},(Edge eSum,Set edges)=> 

eSum["count"]=edges.count()) 


set of edge grouping keys is empty, edges are only 
grouped by their incident vertices. Both, vertices 
and edges in the summarized graph provide a "count” 
property showing the number of original vertices or 
edges. 


Business Intelligence In our second scenario, an 
analyst is interested in common data objects, such 
as employees, customers or products, occuring in all 
high turnover business transaction graphs (business 
process executions) [44]. Algorithm |TT] shows a corre¬ 
sponding GrALa workflow. The initial integrated in¬ 
stance graph iig contains all domain objects and re¬ 
lationships determined by the FoodBroker data gen¬ 
erator [451 . In line 1, a domain specific algorithm 
is executed to extract a collection of business trans¬ 
action graphs (btgs) using the algorithm from [44] . 
Line 2 defines an advanced predicate to select the 
revenue-relevant graphs containing at least one vertex 
with type label Invoice. Line 3 defines an advanced 
aggregation function to calculate the actual revenue 
per graph. In line 4, multiple operations are chained. 
First, only the graph meeting the predicate are se¬ 
lected and second, the actual revenue is aggregated 
and written to the new graph property "revenue" for 
all remaining graphs. In line 5, the graphs with top 
100 revenue aggregates are selected using our sort and 
top operators. Finally, the overlap of all subgraphs is 
determined by applying our reduce operator in line 6. 


Algorithm 11 Top Revenue Business Cases 
Input: integrated instance graph iig 
Output: Common subgraph of top 100 revenue business 
transaction graphs topBtgOverlap 
i: btgs = iig.callForCollection( 

:BusinessTransactionGraphs,{}) 

2 : predicate = (Graph g => 
g.V.select(Vertex v => 

v[:type] == "Saleslnvoice").count() > 0) 

3: aggRevenue = (Graph g => 

g.V.values("revenue"). sum()) 

4: invBtgs = btgs.select(predicate) 

.apply(Graph g => 

g.aggregate("revenue",aggRevenue)) 

5: topRevBtgs = invBtgs 

.sortBy(”revenue",:desc).top(100) 

6: topRevBtgOverlap = invBtgs.reduce( 

Graph g, Graph h => g.overlap(h)) 


Implementation and evaluation For the initial 
evaluation and proof-of-concept we implemented the 
required operators in Giraph and MapReduce and use 
the HBase graph store as data source and sink be¬ 
tween operator executions. Before the workflow runs, 
the generated data set is loaded into HBase using its 
MapReduce bulk import. Vertices are assigned to 
regions using the range partitioning strategy which 
leads to a balanced distribution. The matching and 
combination steps are realized by loading the rele¬ 
vant subgraph from the graph store. We further im¬ 
plemented the Label Propagation algorithm for the 
first use case and the extraction of business transac¬ 
tion graphs for the second use case in Giraph. Selec¬ 
tion, aggregation and summarization have been im¬ 
plemented in MapReduce. As HBase does not na¬ 
tively support secondary indexes, we implemented the 
sort operator by building a secondary index during 
workflow execution with MapReduce. The top and 
overlap operators access the graph store directly to 
select relevant graphs and their elements. Both op¬ 
erators are implemented as regular, non-distributed 
Java applications. Generally, the integration of dif¬ 
ferent frameworks, i.e., HBase, MapReduce and Gi¬ 
raph, could be easily done as they belong to the same 
ecosystem and thus share libraries and internal ap¬ 
proaches like data serialization. 

Table ED shows the results of our initial evaluation 
on a small cluster of five nodes, each equipped with 
an Intel Xeon CPU E5-2430, 48 GB RAM and local 
disk storage. In both use cases, the data generators 
were executed with their default parameters. We ad¬ 
justed the scale factors (SF) to generate graphs of 
different sizes. The resulting sizes are shown in the 
table as well as the time to load them from HDFS into 








the graph store. We observe that for both use cases 
the loading times scale linearly with the graph size. 
The execution times for the analytical workflows in 
the rightmost column also show a linear and thus scal¬ 
able behavior w.r.t. the graph sizes. Nevertheless, we 
observed that the time to read and write data from 
HBase needs to be further optimized, e.g., for Giraph- 
based operators where about 50% of the processing 
time was for loading and distributing the graph. To 
improve this, we need to replace Giraphs own parti¬ 
tioning strategies to avoid data transfers across the 
cluster while loading the graph. 


Datagen 

SF 

\V\ 

\E\ 

Import [s] 

Workflow [s] 

LDBC-SNB 

1 

3.6M 

21.7M 

79 

218 

LDBC-SNB 

10 

34M 

217M 

828 

1984 

FoodBroker 

100 

7M 

70M 

259 

234 

FoodBroker 

1000 

70M 

700M 

2020 

1754 


Table 2: Statistics for both use cases. 


6 Related Work 

We discuss related work on graph data models as well 
as on systems and approaches for graph data manage¬ 
ment and analysis. We also deal with more specific 
recent work on graph analytics related to our opera¬ 
tors. 

A large variety of graph data models has been 
proposed in the last three decades HDH3IE3IES], 
but only two found considerable attention in graph 
data management and processing: the resource de¬ 
scription framework (RDF) [34] and the property 
graph model (PGM) [49] ED]. In contrast to the 
PGM, RDF has some support for multiple graphs 
by the notion of n-quads [2Tj : its standardized query 
language SPARQL s28] also allows queries on mul¬ 
tiple graphs. However, the RDF data represen¬ 
tation by triples is very fine-grained and there is 
no uniform way to represent richer concepts of the 
PGM in RDF [57] so that the distinction of re¬ 
lationships , e.g., (vertexl ,edgel , vertex2), type 
lables, e.g., (edgel , type, knows), and properties, e.g., 
(vertexl,name, 

Alice) has to be done at the application level. In 
consequence, expressing queries involving structural 
and value-based predicates requires non-standard ex¬ 
tensions [53] . 

Gradoop provides a persistent graph store and an 
API to access its elements similar to graph database 
systems, e.g., Neo4j [6] and Sparksee [8]|42]. In con¬ 
trast to most graph database systems, the Gradoop 


store is built on a distributed storage solution and 
can partition the graph data across a cluster. A no¬ 
table exception is Titan [5], a commercial distributed 
graph database supporting different storage systems, 
e.g., Apache Cassandra or HBase. Unlike Gradoop, 
Titan focuses on transactional graph processing and 
the storage layout is built for the PGM. Approaches 
to store and process RDF data in Hadoop are sur¬ 
veyed in [51? . 

MapReduce is heavily used for the parallel analysis 
of voluminous data [56] and has also been applied for 
iterative algorithms E2U3ZI- GLog [26] is a promising 
graph analysis system that extends datalog with ad¬ 
ditional rules for graph querying. GLog queries are 
translated to a series of optimized MapReduce jobs. 
The underlying data model is a so called relational- 
graph table (RG table) storing vertices by nested at¬ 
tributes. Unlike Gradoop, GLog stores RG tables as 
HDFS files and does not address graph partitioning 
and data versioning in general. It also lacks more 
complex graph analytics on multiple graphs and, as 
noted by the authors, suffers from general limitations 
of MapReduce for iterative algorithms that can be 
reduced by graph processing systems. 

Parallel graph processing systems, such as Pregel 
[41] . Giraph [2], GraphLab [40] and the recent 
Pregelix [15], focus on the efficient, distributed ex¬ 
ecution of iterative algorithms on big graphs. The 
provided data models are generic and algorithms 
need to be implemented by user-defined functions. 
Gradoop can use these systems by mapping its high- 
level, declarative operators to their respective generic 
data-models and user-defined functions. Parallel pro¬ 
cessing systems such as Spark [62] and Flink [T] (for¬ 
merly known as Stratosphere m) support analysis 
workflows with high-level graph operators, similarly 
as in Gradoop. However, their graph data models 
and graph operators are limited to single graphs. For 
example, Spark GraphX [60] and Flink Geliy support 
filter operations on vertex and edge sets to extract a 
subgraph from a single graph, but no further analyt¬ 
ical operations on graph collections. Specific graph 
algorithms, e.g., connected components or page rank, 
are implemented as dedicated operators in GraphX 
and Flink. The mentioned graph processing systems 
focus on in-memory processing and do not provide 
a persistent, distributed graph store like Gradoop. 
Still, we see Spark and Flink as powerful and sup¬ 
portive platforms for our approach and we plan to 
make use of them. 

Shared memory cluster systems such as Trinity [54] 
address both online query processing and offline ana¬ 
lytics on distributed graphs. Trinity demands a strict 
schema definition and has no support for graph col- 









lections. The system offers an API to access graph 
elements and leaves the implementation of analytical 
operators to the user. Furthermore, Trinity is an in¬ 
memory system with no support for persistent graph 
data management. 

There are also graph analytics tools built on top 
of relational database systems thereby utilizing their 
proven performance techniques for query processing. 
For example, Vertexica m offers a vertex-centric 
query interface for analysts to express graph queries 
as user-defined functions which are executed by a 
SQL engine. In contrast to Gradoop, relational graph 
stores may be less suited to support schema-flexible 
graph models such as the EPGM and are not as well 
integrated in the Hadoop ecosystem to utilize its po¬ 
tential for parallel graph mining. 

Many publications propose specific implementa¬ 
tions related to some of our operators, although we 
can only discuss some of them here. Similar to our 
summarization operator, Rudolf et al. [52] describe 
a visual approach to declaratively define graph sum¬ 
maries. Liu et al describe distributed algorithms for 
graph summarization [3T)1 . OLAP-like graph anal¬ 
ysis using multiple summaries of a graph is pro¬ 
posed among others in [181164] . Some approaches sup¬ 
port heterogeneous graphs [61], grouping by edge at¬ 
tributes [59] and the generation of a predefined num¬ 
ber of vertex groups [5811GTTI . Summaries are not 
only useful as a simplified representation of a graph, 
but also to optimize queries [36j|48]. While pattern 
matching queries mm are a typical part of exist¬ 
ing graph data models, recent work is focussing on 
pattern queries in graph collections El, distributed 
graphs [21] and by graph similarity H. 

7 Conclusions 

We presented an overview of Gradoop, an end-to-end 
approach for Hadoop-based management and analy¬ 
sis of graph data. Its underlying extended property 
graph data model (EPGM) builds upon the proven 
property graph model but extends its with support 
for collections of graphs which we expect to be a ma¬ 
jor asset in many analytical applications. The pro¬ 
posed set of operators provides basic analysis and ag¬ 
gregation capabilities, graph summarization and col¬ 
lection processing. All operators as well as the invo¬ 
cation of graph mining algorithms are usable within 
the GrALa language to specify analysis scripts. The 
Gradoop store is realized based on HBase and sup¬ 
ports scalability to very large graphs, graph version¬ 
ing, partitioned storage and fault tolerance. An ini¬ 
tial evaluation shows the flexibility of the proposed 


operators to define analytical workflows in different 
domains as well as the scalability of the parallel data 
import and workflow execution for different graph 
sizes. 

Gradoop is still in its initial phase so that many 
parts of the system need to be completed and op¬ 
timized, in particular operator implementations, the 
upper layers of the architecture (Figure [2]) and data 
integration workflows. For the efficient implemen¬ 
tation of the workflow execution layer and oper¬ 
ators we will evaluate and possibly utilize avail¬ 
able Flink and GraphX functionality. We will 
also address component-specific research questions 
such as customizable graph partitioning at differ¬ 
ent storage layers (HBase, in-memory) and the op¬ 
timization of specific operators and entire workflows. 
Gradoop will be open-source and made available un¬ 
der www.gradoop.com. 


This work is partially funded by the German Federal 
Ministry of Education and Research under project 
ScaDS Dresden/Leipzig (BMBF 01IS14014B). 
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